# Logos
_Create ridiculously fast Lexers._
**Logos** has two goals:
+ To make it easy to create a Lexer, so you can focus on more complex problems.
+ To make the generated Lexer faster than anything you'd write by hand.
To achieve those, **Logos**:
+ Combines all token definitions into a single [deterministic state machine](https://en.wikipedia.org/wiki/Deterministic_finite_automaton).
+ Optimizes branches into [lookup tables](https://en.wikipedia.org/wiki/Lookup_table) or [jump tables](https://en.wikipedia.org/wiki/Branch_table).
+ Prevents [backtracking](https://en.wikipedia.org/wiki/ReDoS) inside token definitions.
+ [Unwinds loops](https://en.wikipedia.org/wiki/Loop_unrolling), and batches reads to minimize bounds checking.
+ Does all of that heavy lifting at compile time.
## Example
```rust
use logos::Logos;
#[derive(Logos, Debug, PartialEq)]
enum Token {
// Tokens can be literal strings, of any length.
#[token("fast")]
Fast,
#[token(".")]
Period,
// Or regular expressions.
#[regex("[a-zA-Z]+")]
Text,
// Logos requires one token variant to handle errors,
// it can be named anything you wish.
#[error]
// We can also use this variant to define whitespace,
// or any other matches we wish to skip.
#[regex(r"[ \t\n\f]+", logos::skip)]
Error,
}
fn main() {
let mut lex = Token::lexer("Create ridiculously fast Lexers.");
assert_eq!(lex.next(), Some(Token::Text));
assert_eq!(lex.span(), 0..6);
assert_eq!(lex.slice(), "Create");
assert_eq!(lex.next(), Some(Token::Text));
assert_eq!(lex.span(), 7..19);
assert_eq!(lex.slice(), "ridiculously");
assert_eq!(lex.next(), Some(Token::Fast));
assert_eq!(lex.span(), 20..24);
assert_eq!(lex.slice(), "fast");
assert_eq!(lex.next(), Some(Token::Text));
assert_eq!(lex.slice(), "Lexers");
assert_eq!(lex.span(), 25..31);
assert_eq!(lex.next(), Some(Token::Period));
assert_eq!(lex.span(), 31..32);
assert_eq!(lex.slice(), ".");
assert_eq!(lex.next(), None);
}
```
### Callbacks
**Logos** can also call arbitrary functions whenever a pattern is matched,
which can be used to put data into a variant:
```rust
use logos::{Logos, Lexer};
// Note: callbacks can return `Option` or `Result`
fn kilo(lex: &mut Lexer
) -> Option {
let slice = lex.slice();
let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k'
Some(n * 1_000)
}
fn mega(lex: &mut Lexer) -> Option {
let slice = lex.slice();
let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm'
Some(n * 1_000_000)
}
#[derive(Logos, Debug, PartialEq)]
enum Token {
#[regex(r"[ \t\n\f]+", logos::skip)]
#[error]
Error,
// Callbacks can use closure syntax, or refer
// to a function defined elsewhere.
//
// Each pattern can have it's own callback.
#[regex("[0-9]+", |lex| lex.slice().parse())]
#[regex("[0-9]+k", kilo)]
#[regex("[0-9]+m", mega)]
Number(u64),
}
fn main() {
let mut lex = Token::lexer("5 42k 75m");
assert_eq!(lex.next(), Some(Token::Number(5)));
assert_eq!(lex.slice(), "5");
assert_eq!(lex.next(), Some(Token::Number(42_000)));
assert_eq!(lex.slice(), "42k");
assert_eq!(lex.next(), Some(Token::Number(75_000_000)));
assert_eq!(lex.slice(), "75m");
assert_eq!(lex.next(), None);
}
```
Logos can handle callbacks with following return types:
| Return type | Produces |
|-----------------------------------|----------------------------------------------------|
| `()` | `Token::Unit` |
| `bool` | `Token::Unit` **or** `::ERROR` |
| `Result<(), _>` | `Token::Unit` **or** `::ERROR` |
| `T` | `Token::Value(T)` |
| `Option` | `Token::Value(T)` **or** `::ERROR` |
| `Result` | `Token::Value(T)` **or** `::ERROR` |
| [`Skip`](./struct.Skip.html) | _skips matched input_ |
| [`Filter`](./enum.Filter.html) | `Token::Value(T)` **or** _skips matched input_ |
Callbacks can be also used to do perform more specialized lexing in place
where regular expressions are too limiting. For specifics look at
[`Lexer::remainder`](./struct.Lexer.html#method.remainder) and
[`Lexer::bump`](./struct.Lexer.html#method.bump).
## Token disambiguation
Rule of thumb is:
+ Longer beats shorter.
+ Specific beats generic.
If any two definitions could match the same input, like `fast` and `[a-zA-Z]+`
in the example above, it's the longer and more specific definition of `Token::Fast`
that will be the result.
This is done by comparing numeric priority attached to each definition. Every consecutive,
non-repeating single byte adds 2 to the priority, while every range or regex class adds 1.
Loops or optional blocks are ignored, while alternations count the shortest alternative:
+ `[a-zA-Z]+` has a priority of 1 (lowest possible), because at minimum it can match a single byte to a class.
+ `foobar` has a priority of 12.
+ `(foo|hello)(bar)?` has a priority of 6, `foo` being it's shortest possible match.